The HIPPIE database is a dataset that integrates multiple experimental PPI datasets. For a given protein interaction it can produce a confidence value that the interaction exists. Unfortunately, it is important to note a concern with using this is that the gold standard dataset currently chosen to use in this project is integrated into the HIPPIE dataset. This could mean that the classifier can simply rely on this feature to predict the training or test set 100%.

In this case the solution is simply to threshold the HIPPIE dataset at a high confidence value and use it as the gold standard dataset instead. This might be a good idea regardless of the results of the test.

The aim of this notebook is to:

  1. Create a file containing the HIPPIE confidence values for each protein pair in our gold standard dataset and active zone network.
  2. See if the predictions of the HIPPIE dataset are the same as the DIP results.

Reformatting the lists

The complete list is spread among a few different files. These files need to be reformatted to be entered into HIPPIE's web service. First, we need to strip the zeros and ones from the two gold standard training files:

  1. training.negative.Entrez.txt
  2. training.positive.Entrez.txt

The new files will be called:

  1. training.nolabel.negative.Entrez.txt
  2. training.nolabel.positive.Entrez.txt

In [1]:
cd /home/gavin/Documents/MRes/DIP/human/


/home/gavin/Documents/MRes/DIP/human

In [2]:
ls


DIPtouniprot.tab  flat.Entrez.txt   Hsapi20140427.txt    interacting.Entrez.txt   training.negative.Entrez.txt  uniprottoEntrez.tab
flat.DIP.txt      flat.uniprot.txt  interacting.DIP.txt  interacting.uniprot.txt  training.positive.Entrez.txt

In [6]:
import csv

In [8]:
rc = csv.reader(open("training.negative.Entrez.txt"), delimiter="\t")
wc = csv.writer(open("training.nolabel.negative.Entrez.txt", "w"), delimiter="\t")
for line in rc:
    line = (line[0],line[1])
    wc.writerow(line)

In [10]:
rc = csv.reader(open("training.positive.Entrez.txt"), delimiter="\t")
wc = csv.writer(open("training.nolabel.positive.Entrez.txt", "w"), delimiter="\t")
for line in rc:
    line = (line[0],line[1])
    wc.writerow(line)

Bait and prey combinations

All protein pairs from the bait and prey lists need to be placed in a file. First, load in the the bait and prey Entrez identifiers:


In [16]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/


/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS

In [17]:
ls


baitDIPcrossover.Entrez.txt  baits.csv  baits_entrez_ids_ActiveZone.csv  baits_entrez_ids.csv  ensembl_bait_human_ids.csv

In [18]:
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))

In [19]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS/


/home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS

In [20]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))

In [21]:
#combine these lists
pulldownids = baitids + preyids

At this point we have a list pulldownids of Entrez IDs for all proteins found in the pulldown experiments.

We can now use that list to build a set of all possible combinations using itertools:


In [22]:
import itertools

In [23]:
#initialise list
pulldowncomb = []
#iterate over all possible combinations, adding to the list
for pair in itertools.combinations(pulldownids,2):
    pulldowncomb.append(frozenset(pair))
#convert list to set
pulldowncomb = set(pulldowncombb)

How many combinations are there?


In [24]:
print "Number of combinations of pulldown protein IDs: %i"%(len(pulldowncomb))


Number of combinations of pulldown protein IDs: 1699335

Saving this set to a file. The file will be named pulldown.combinations.Entrez.txt.


In [30]:
csv.writer(open("pulldown.combinations.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: list(x) ,list(pulldowncomb)))

Using HIPPIE command line tool

For large sets of protein pairs, the HIPPIE web service redirects to the command line tool. Specifically, the readme for the tool required is here and the tool can be found here.


In [32]:
cd /home/gavin/Documents/MRes/HIPPIE/


/home/gavin/Documents/MRes/HIPPIE

To retreive the most recent version of the HIPPIE dataset, script and readme:


In [35]:
%%bash
wget -q http://cbdm.mdc-berlin.de/tools/hippie/hippie_current.txt
wget -q http://cbdm.mdc-berlin.de/tools/hippie/NC/HIPPIE_NC.jar
wget -q http://cbdm.mdc-berlin.de/tools/hippie/NC/README.txt

The quick start guide gives usage for taking an input file when the hippie database is in the same directory and named hippie_current.txt:

java -jar HIPPIE_NC.jar -i=query.txt

We would also like to specify an output file, which is done using:

java -jar HIPPIE_NC.jar -i=query.txt -o=out.txt

Also, we should probably specify that the proteins will be given in Entrez format:

java -jar HIPPIE_NC.jar -i=query.txt -t=e -o=out.txt

There is also an option to restrict HIPPIE to the proteins that it is supplied, otherwise it searches for interactions between the proteins supplied and all the proteins it knows about. This is the layer option and must be set to zero:

java -jar HIPPIE_NC.jar -i=query.txt -l=0 -t=e -o=out.txt

The files which must be queried using this tool are given below:

  1. training.nolabel.negative.Entrez.txt - negative training examples from gold standard dataset.
  2. training.nolabel.positive.Entrez.txt - positive training examples from gold standard dataset.
  3. pulldown.combinations.Entrez.txt - all possible combinations of proteins from pulldown experiments.

Starting with the smallest file, which is training.nolabel.positive.Entrez.txt:


In [45]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.positive.Entrez.txt -t=e -l=0 -o=training.positive.HIPPIE.txt

Looking at the file and checking to see if all pairs were mapped:


In [49]:
%%bash
head training.positive.HIPPIE.txt
wc -l ../DIP/human/training.positive.Entrez.txt
wc -l training.positive.HIPPIE.txt


	3783709		7874	0		1
GCR_HUMAN	2908	ADA_HUMAN	100	0.85	experiments:in vitro,Reconstituted Complex,pull down;pmids:9154805,16189514;sources:HPRD,BioGRID,I2D,Rual05	0
	10111		50484	0		1
TP53B_HUMAN	7158	CDC16_HUMAN	8881	0.63	experiments:affinity chromatography technology;pmids:22990118;sources:BioGRID	0
	3020		3021	0		1
GRB2_HUMAN	2885	HSPB1_HUMAN	3315	0.56	experiments:pull down;pmids:12577067;sources:IntAct	0
TPOR_HUMAN	4352	PTN11_HUMAN	5781	0.52	experiments:in vivo;pmids:8541543;sources:HPRD,I2D	0
CAV1_HUMAN	857	ERBB2_HUMAN	2064	0.55	experiments:in vitro,in vivo;pmids:9685399;sources:HPRD,I2D	0
1433Z_HUMAN	7534	ATPA_HUMAN	498	0.89	experiments:in vivo,coimmunoprecipitation,gst pull down,affinity chromatography technology,pull down,tandem affinity purification;pmids:15324660,15161933,20618440;sources:HPRD,MINT,I2D,BioGRID,IntAct	0
PAN2_HUMAN	9924	RUVB2_HUMAN	10856	0.63	experiments:affinity chromatography technology;pmids:23398456;sources:BioGRID	0
5103 ../DIP/human/training.positive.Entrez.txt
43416 training.positive.HIPPIE.txt

There are a much larger number of pairs after conversion because the HIPPIE script simply takes the proteins as a list and finds all the interactions it knows about between those proteins. As this includes not just the DIP dataset but also many others there is a larger number of interactions available without setting a cuttoff. To deal with this, we require a script to match the confidence values in the file produced with only the interacting pairs we care about.


In [50]:
cd /home/gavin/Documents/MRes/DIP/human/


/home/gavin/Documents/MRes/DIP/human

In [51]:
#initialise csv reader
c = csv.reader(open("training.nolabel.positive.Entrez.txt"), delimiter="\t")
#make dictionary using frozensets as keys:
posids = {}
for line in c:
    line = frozenset(line)
    posids[line] = 1

In [52]:
cd /home/gavin/Documents/MRes/HIPPIE/


/home/gavin/Documents/MRes/HIPPIE

In [55]:
#initialise csv reader
c = csv.reader(open("training.positive.HIPPIE.txt"), delimiter="\t")
#make dictionary using frozensets as keys with the confidence scores as values
hippieids = {}
for line in c:
    k = frozenset([line[1],line[3]])
    hippieids[k] = line[4]

In [57]:
%%bash
head training.positive.HIPPIE.txt
wc -l ../DIP/human/training.positive.Entrez.txt
wc -l training.positive.HIPPIE.txt


5071	65018	0.82
29102	1655	0.14
5914	16475	0
6883	6880	0.87
2099	396442	0
8330	27000	0
2244	1497146	0
166379	582	0.63
5216	850504	0
4792	4793	0.94
5103 ../DIP/human/training.positive.Entrez.txt
4752 training.positive.HIPPIE.txt

Strangely enough, we've dropped some pairs in the conversion, which doesn't make a lot of sense. It could be that the HIPPIE database is not up to date with the pairs that are in the DIP database. Or the method used to map from the DIP protein identifiers to Entrez used by HIPPIE differs from our method.

Also odd is the fact that some of these protein pairs have a confidence value of zero attached to them. At this point it's uncertain why this might be. It could be that these values were known to be missing from the HIPPIE database so the HIPPIE script has labelled them zero. Or, it could be that the script is assigns no confidence to the evidence for these interactions from DIP.

In any case, the two databases clearly do not match up exactly. Deciding whether to use the HIPPIE database as a gold standard instead of DIP is going to have to be discussed. The code will be designed such that the gold standard database can be easily switched in any case.

The above code was modified and made into a simple Python script called hippiematch.py to use on the remaining two files. So repeating the above for the final two files can be done as follows:


In [68]:
%%bash
python2 ../opencast-bio/scripts/hippiematch.py -h


Usage: python2 hippiematch.py hippiefile pairfile outputfile
Where: 
    hippiefile is the out of the HIPPIE script
    pairfile is the file of protein pairs to match
    outputfile is the name of the file to output to

In [73]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.negative.Entrez.txt -t=e -l=0 -o=training.negative.HIPPIE.txt
python2 ../opencast-bio/scripts/hippiematch.py training.negative.HIPPIE.txt ../DIP/human/training.nolabel.negative.Entrez.txt training.negative.HIPPIE.txt


3060531 of 3060684 pairs matched.

In [74]:
%%bash
java -jar HIPPIE_NC.jar -i=../forGAVIN/pulldown_data/PREYS/pulldown.combinations.Entrez.txt -t=e -l=0 -o=pulldown.combinations.HIPPIE.txt
python2 ../opencast-bio/scripts/hippiematch.py pulldown.combinations.HIPPIE.txt ../forGAVIN/pulldown_data/PREYS/pulldown.combinations.Entrez.txt pulldown.combinations.HIPPIE.txt


1698968 of 1699335 pairs matched.